OBJECTIVE AND DATA DESCRIPTION

Background:

Banks loan out money to customers to finance a car, a house, pay for education, consolidate loans, etc. Borrowers agree to pay back the money with an interest on a monthly basis. Sometimes, due to unexpected circumstances, some borrowers are not able to pay back the money.

Research Objectives:

The banks would want to see what is the pattern in the customers to predict if a customer can pay back the loan, so the bank knows who to lend out the money. I would like to predict if any customer goes to a bank, should the bank loan out the money to the customer base on model learned from this dataset.

Research Questions:

  1. What factors contribute/correlated most to bank loan status?

  2. Can we predict if a borrower will be able to pay the debt in full?

  3. What Machine Learning algorithms perform best in the prediction? (First Guess)

  4. Optimize all (or the best) algorithms. With the fine tuned hyperparameters, what is the best prediction performance? Which algorithm?

Dataset

The dataset is taken from Kaggle (https://www.kaggle.com/zaurbegiev/my-dataset). There are over 100,000 rows and 19 columns (features) in this dataset. The predicted feature variable is Loan_Status, which is a categorical variable with value either “Fully Paid” or “Charged off”. Fully Paid means the borrower can pay back the debt, while charged off means the borrower is unlikely pay the bank after a substantial delinquent for a period of time. The remainder of the debt is sometimes collected by a third-party agency.

LoanID,

CustomerID,

Loan_Status,

Current_Loan_Amount,

Term,

Credit_Score,

Annual_Income,

Years_in_current_job,

Home_Ownership,

Purpose,

Monthly_Debt,

Years_of_Credit_History,

Months_since_last_delinquent,

Number_of_Open_Accounts,

Number_of_Credit_Problems,

Current_Credit_Balance,

Maximum_Open_Credit,

Bankruptcies,

Tax_Liens

STEP 1: LOADING LIBRARIES AND DATA

library(caret)
library(dplyr)
library(tidyverse)
library(mlbench)
library(gmodels)
# for multiple plots in one figure (ggarrange)
library(ggpubr)
library(ggplot2)
library(clustertend)
library(factoextra)
library(corrplot)
library(cluster)
library(magrittr)
library(fpc)
library(pheatmap)
library(clValid)

theme_set(theme_classic())

Examing data

# Clear memory
rm(list=ls())

# read in data
BankLoan_dataset <- read.csv("LoanStatus.csv")

# remove unrelevant fields
BankLoan_dataset <- subset(BankLoan_dataset, select=-c(LoanID, CustomerID))

# data structure
knitr::kable(str(BankLoan_dataset))
## 'data.frame':    100000 obs. of  17 variables:
##  $ Loan_Status                 : chr  "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
##  $ Current_Loan_Amount         : int  445412 262328 99999999 347666 176220 206602 217646 648714 548746 215952 ...
##  $ Term                        : chr  "Short Term" "Short Term" "Short Term" "Long Term" ...
##  $ Credit_Score                : int  709 NA 741 721 NA 7290 730 NA 678 739 ...
##  $ Annual_Income               : int  1167493 NA 2231892 806949 NA 896857 1184194 NA 2559110 1454735 ...
##  $ Years_in_current_job        : chr  "8 years" "10+ years" "8 years" "3 years" ...
##  $ Home_Ownership              : chr  "Home Mortgage" "Home Mortgage" "Own Home" "Own Home" ...
##  $ Purpose                     : chr  "Home Improvements" "Debt Consolidation" "Debt Consolidation" "Debt Consolidation" ...
##  $ Monthly_Debt                : num  5215 33296 29201 8742 20640 ...
##  $ Years_of_Credit_History     : num  17.2 21.1 14.9 12 6.1 17.3 19.6 8.2 22.6 13.9 ...
##  $ Months_since_last_delinquent: int  NA 8 29 NA NA NA 10 8 33 NA ...
##  $ Number_of_Open_Accounts     : int  6 35 18 9 15 6 13 15 4 20 ...
##  $ Number_of_Credit_Problems   : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ Current_Credit_Balance      : int  228190 229976 297996 256329 253460 215308 122170 193306 437171 669560 ...
##  $ Maximum_Open_Credit         : int  416746 850784 750090 386958 427174 272448 272052 864204 555038 1021460 ...
##  $ Bankruptcies                : int  1 0 0 0 0 0 1 0 0 0 ...
##  $ Tax_Liens                   : int  0 0 0 0 0 0 0 0 0 0 ...

|| || || ||

# data summary
knitr::kable(head(BankLoan_dataset))
Loan_Status Current_Loan_Amount Term Credit_Score Annual_Income Years_in_current_job Home_Ownership Purpose Monthly_Debt Years_of_Credit_History Months_since_last_delinquent Number_of_Open_Accounts Number_of_Credit_Problems Current_Credit_Balance Maximum_Open_Credit Bankruptcies Tax_Liens
Fully Paid 445412 Short Term 709 1167493 8 years Home Mortgage Home Improvements 5214.74 17.2 NA 6 1 228190 416746 1 0
Fully Paid 262328 Short Term NA NA 10+ years Home Mortgage Debt Consolidation 33295.98 21.1 8 35 0 229976 850784 0 0
Fully Paid 99999999 Short Term 741 2231892 8 years Own Home Debt Consolidation 29200.53 14.9 29 18 1 297996 750090 0 0
Fully Paid 347666 Long Term 721 806949 3 years Own Home Debt Consolidation 8741.90 12.0 NA 9 0 256329 386958 0 0
Fully Paid 176220 Short Term NA NA 5 years Rent Debt Consolidation 20639.70 6.1 NA 15 0 253460 427174 0 0
Charged Off 206602 Short Term 7290 896857 10+ years Home Mortgage Debt Consolidation 16367.74 17.3 NA 6 0 215308 272448 0 0
knitr::kable(summary(BankLoan_dataset))
Loan_Status Current_Loan_Amount Term Credit_Score Annual_Income Years_in_current_job Home_Ownership Purpose Monthly_Debt Years_of_Credit_History Months_since_last_delinquent Number_of_Open_Accounts Number_of_Credit_Problems Current_Credit_Balance Maximum_Open_Credit Bankruptcies Tax_Liens
Length:100000 Min. : 10802 Length:100000 Min. : 585 Min. : 76627 Length:100000 Length:100000 Length:100000 Min. : 0 Min. : 3.6 Min. : 0.0 Min. : 0.00 Min. : 0.0000 Min. : 0 Min. :0.000e+00 Min. :0.0000 Min. : 0.00000
Class :character 1st Qu.: 179652 Class :character 1st Qu.: 705 1st Qu.: 848844 Class :character Class :character Class :character 1st Qu.: 10214 1st Qu.:13.5 1st Qu.: 16.0 1st Qu.: 8.00 1st Qu.: 0.0000 1st Qu.: 112670 1st Qu.:2.734e+05 1st Qu.:0.0000 1st Qu.: 0.00000
Mode :character Median : 312246 Mode :character Median : 724 Median : 1174162 Mode :character Mode :character Mode :character Median : 16220 Median :16.9 Median : 32.0 Median :10.00 Median : 0.0000 Median : 209817 Median :4.679e+05 Median :0.0000 Median : 0.00000
NA Mean :11760447 NA Mean :1076 Mean : 1378277 NA NA NA Mean : 18472 Mean :18.2 Mean : 34.9 Mean :11.13 Mean : 0.1683 Mean : 294637 Mean :7.608e+05 Mean :0.1177 Mean : 0.02931
NA 3rd Qu.: 524942 NA 3rd Qu.: 741 3rd Qu.: 1650663 NA NA NA 3rd Qu.: 24012 3rd Qu.:21.7 3rd Qu.: 51.0 3rd Qu.:14.00 3rd Qu.: 0.0000 3rd Qu.: 367959 3rd Qu.:7.830e+05 3rd Qu.:0.0000 3rd Qu.: 0.00000
NA Max. :99999999 NA Max. :7510 Max. :165557393 NA NA NA Max. :435843 Max. :70.5 Max. :176.0 Max. :76.00 Max. :15.0000 Max. :32878968 Max. :1.540e+09 Max. :7.0000 Max. :15.00000
NA NA NA NA’s :19154 NA’s :19154 NA NA NA NA NA NA’s :53141 NA NA NA NA’s :2 NA’s :204 NA’s :10

STEP 2: DATA PREPROCESSING

a) Convert to numeric values

BankLoan_dataset$Years_in_current_job = extract_numeric(BankLoan_dataset$Years_in_current_job)
BankLoan_dataset$Years_in_current_job <- as.numeric(BankLoan_dataset$Years_in_current_job)

b) Convert Categorical Variables

c) Remove Outliers

d) Remove Null

# # identify outliers
# outliers <- boxplot(BankLoan_dataset$Annual_Income, plot=FALSE)$out
# # remove outliers
# BankLoan_dataset <- BankLoan_dataset[-which(BankLoan_dataset$Annual_Income %in% outliers), ] 
#

We cannot remove outliers for individual feature one at a time, because after removing the outliers, some of the records are removing. The rows will be mismatched when we combine the features together. Therefore, we have to use the pipe function like below.

# show outliers before removing them
histogram(BankLoan_dataset$Annual_Income)

bankloan <- BankLoan_dataset %>%
  # convert string feature into categorical factors
  mutate_if(is.character, as.factor) %>% 
  
  # remove the nulls
  drop_na() %>%

  ### Remove outliers
  # Annual_Income
  filter(between(Annual_Income, 
                 quantile(Annual_Income, 0.25) - 1.5* IQR(Annual_Income),
                 quantile(Annual_Income, 0.75) + 1.5* IQR(Annual_Income))) %>%
  # Current_Loan_Amount
  filter(between(Current_Loan_Amount, 
                 quantile(Current_Loan_Amount, 0.25) - 1.5* IQR(Current_Loan_Amount),
                 quantile(Current_Loan_Amount, 0.75) + 1.5* IQR(Current_Loan_Amount))) %>%
  # Credit_Score
  filter(between(Credit_Score, 
                 quantile(Credit_Score, 0.25) - 1.5* IQR(Credit_Score),
                 quantile(Credit_Score, 0.75) + 1.5* IQR(Credit_Score))) %>%
  #Number_of_Credit_Problems (there is an outlier of 15)
  filter(between(Number_of_Credit_Problems, 
                 quantile(Number_of_Credit_Problems, 0.25) - 1.5* IQR(Number_of_Credit_Problems),
                 4)) %>%
  # Monthly_Debt
  filter(between(Monthly_Debt, 
               quantile(Monthly_Debt, 0.25) - 1.5* IQR(Monthly_Debt),
               quantile(Monthly_Debt, 0.75) + 1.5* IQR(Monthly_Debt))) %>%

  # Maximum_Open_Credit
  filter(between(Maximum_Open_Credit, 
               quantile(Maximum_Open_Credit, 0.25) - 1.5* IQR(Maximum_Open_Credit),
               quantile(Maximum_Open_Credit, 0.75) + 1.5* IQR(Maximum_Open_Credit))) %>%
    
  #remove null after filling outliers with NA
  drop_na()

knitr::kable(str(bankloan))
## 'data.frame':    25185 obs. of  17 variables:
##  $ Loan_Status                 : Factor w/ 2 levels "Charged Off",..: 2 2 2 1 2 2 1 2 2 1 ...
##  $ Current_Loan_Amount         : int  217646 548746 234124 317108 465410 449108 688468 129712 287980 374176 ...
##  $ Term                        : Factor w/ 2 levels "Long Term","Short Term": 2 2 2 1 1 2 1 2 2 1 ...
##  $ Credit_Score                : int  730 678 727 687 688 718 682 723 737 652 ...
##  $ Annual_Income               : int  1184194 2559110 693234 1133274 1722654 1454507 1494616 1465698 1013954 1239199 ...
##  $ Years_in_current_job        : num  1 2 10 8 3 8 1 10 1 10 ...
##  $ Home_Ownership              : Factor w/ 4 levels "HaveMortgage",..: 2 4 4 4 4 2 4 3 2 2 ...
##  $ Purpose                     : Factor w/ 16 levels "Business Loan",..: 4 4 4 4 3 4 4 4 4 10 ...
##  $ Monthly_Debt                : num  10855 18660 14211 9633 15647 ...
##  $ Years_of_Credit_History     : num  19.6 22.6 24.7 17.4 22.3 28.8 16.6 19.4 18.6 36.6 ...
##  $ Months_since_last_delinquent: int  10 33 46 53 30 21 50 6 13 42 ...
##  $ Number_of_Open_Accounts     : int  13 4 10 4 7 14 8 34 11 10 ...
##  $ Number_of_Credit_Problems   : int  1 0 1 0 0 0 0 1 0 0 ...
##  $ Current_Credit_Balance      : int  122170 437171 28291 60287 107559 193990 343995 45106 223117 126350 ...
##  $ Maximum_Open_Credit         : int  272052 555038 107052 126940 488356 458414 843854 163218 489302 415602 ...
##  $ Bankruptcies                : int  1 0 1 0 0 0 0 0 0 0 ...
##  $ Tax_Liens                   : int  0 0 0 0 0 0 0 0 0 0 ...

|| || || ||

knitr::kable(summary(bankloan))
Loan_Status Current_Loan_Amount Term Credit_Score Annual_Income Years_in_current_job Home_Ownership Purpose Monthly_Debt Years_of_Credit_History Months_since_last_delinquent Number_of_Open_Accounts Number_of_Credit_Problems Current_Credit_Balance Maximum_Open_Credit Bankruptcies Tax_Liens
Charged Off: 4628 Min. : 21450 Long Term : 6672 Min. :647.0 Min. : 111245 Min. : 1.000 HaveMortgage : 59 Debt Consolidation:19640 Min. : 0 Min. : 3.80 Min. : 0.00 Min. : 2.00 Min. :0.0000 Min. : 0 Min. : 0 Min. :0.0000 Min. :0.00000
Fully Paid :20557 1st Qu.:160336 Short Term:18513 1st Qu.:702.0 1st Qu.: 882911 1st Qu.: 3.000 Home Mortgage:12444 other : 1633 1st Qu.:10559 1st Qu.:14.20 1st Qu.: 17.00 1st Qu.: 8.00 1st Qu.:0.0000 1st Qu.: 93841 1st Qu.: 231880 1st Qu.:0.0000 1st Qu.:0.00000
NA Median :259886 NA Median :719.0 Median :1185961 Median : 6.000 Own Home : 2175 Home Improvements : 1555 Median :15993 Median :17.40 Median : 32.00 Median :10.00 Median :0.0000 Median : 170221 Median : 384648 Median :0.0000 Median :0.00000
NA Mean :289590 NA Mean :715.1 Mean :1277038 Mean : 6.201 Rent :10507 Other : 812 Mean :17067 Mean :18.64 Mean : 35.36 Mean :11.04 Mean :0.1873 Mean : 210122 Mean : 448406 Mean :0.1241 Mean :0.02974
NA 3rd Qu.:392436 NA 3rd Qu.:734.0 3rd Qu.:1580040 3rd Qu.:10.000 NA Business Loan : 357 3rd Qu.:22590 3rd Qu.:21.90 3rd Qu.: 52.00 3rd Qu.:13.00 3rd Qu.:0.0000 3rd Qu.: 285703 3rd Qu.: 613426 3rd Qu.:0.0000 3rd Qu.:0.00000
NA Max. :789096 NA Max. :751.0 Max. :2960447 Max. :10.000 NA Medical Bills : 312 Max. :43110 Max. :70.50 Max. :176.00 Max. :48.00 Max. :4.0000 Max. :2028877 Max. :1302114 Max. :4.0000 Max. :4.00000
NA NA NA NA NA NA NA (Other) : 876 NA NA NA NA NA NA NA NA NA
#check if there is null
is.null(bankloan)
## [1] FALSE
# showing no outliers
histogram(bankloan$Annual_Income)

Histogram shows outliers are removed. Annual income is skewed right. After removing outliers and Null values, there are still 26,500 data left to work with.

STEP 3: STATISTICAL SUMMARY

a) Graphical Summary

# BAR graph for categorical variables

ggplot(bankloan, aes(x=Purpose)) +
  geom_bar() +
  coord_flip()

gg_status <- ggplot(bankloan, aes(x=Loan_Status)) +
  geom_bar()

gg_home <- ggplot(bankloan, aes(x=Home_Ownership)) +
  geom_bar() +
  coord_flip()

gg_problem <- ggplot(bankloan, aes(x=Number_of_Credit_Problems)) +
  geom_bar() 

gg_job <- ggplot(bankloan, aes(x=Years_in_current_job)) +
  geom_bar() 

# arrange multiple plots in one figure  
figure <- ggarrange(gg_status, gg_home, gg_problem, gg_job,
                    ncol = 2, nrow = 2,
                    legend="none")
figure

# Boxplot for quantitative variables
ggplot(bankloan, aes(x=Loan_Status, y=Annual_Income)) +
  geom_boxplot()

ggplot(bankloan, aes(x=Loan_Status, y=Current_Loan_Amount)) +
  geom_boxplot()

ggplot(bankloan, aes(x=Loan_Status, y=Credit_Score)) +
  geom_boxplot()

ggplot(bankloan, aes(x=Loan_Status, y=Monthly_Debt)) +
  geom_boxplot()

From the bar graphs, we can see that the data is imbalanced, Paid Fully is much more than Charge Off. The home_ownership is have_mortgage most and rent is secondly. Most people have zero number of credit problems. Most people work 10+ years in current job. The most purpose of loans is Debt Consolidation.

From the Boxplots, comparing annual income, current loan amount, monthly debt and credit score, annual income seems to be the biggest difference between Fully Paid and Charged Off customers.

b) Numerical Summary

# proportion of Paid-fully and Charged-off
table(bankloan$Loan_Status)
## 
## Charged Off  Fully Paid 
##        4628       20557
round(prop.table(table(bankloan$Loan_Status)) * 100, 1)
## 
## Charged Off  Fully Paid 
##        18.4        81.6
# Average group by loan status
# Annual_Income is column5, credit score is column4
aggregate(bankloan[, 4:5], list(bankloan$Loan_Status), median)
##       Group.1 Credit_Score Annual_Income
## 1 Charged Off          718       1116250
## 2  Fully Paid          719       1212561

Numerical statistics shows Fully Paid is 81.8% of the total data. The median annual income for fully paid customers is 1.23M and 1.12M for charged off customers. Since the annual income is right-skew, median is used instead of mean.

STEP 4: PRINCIPAL COMPONENT ANALYSIS (PCA)

a) Balancing Data by Downsampling

set.seed(9650)
# Divide data into train and test sets
indexTrain <- createDataPartition(y=bankloan$Loan_Status, p=0.75, list=FALSE) 
training <- bankloan[indexTrain, ]
testing <- bankloan[-indexTrain, ]

# Since there are 26000 of data, take a subset of data for visual clarity
training <- training[1:2000, ]
testing <- testing[1:500, ]

# Downsampling to balance data
train <- downSample(x=training[, -ncol(training)], y=training$Loan_Status)
test <- downSample(x=testing[, -ncol(testing)], y=testing$Loan_Status)

table(train$Loan_Status)
## 
## Charged Off  Fully Paid 
##         375         375
table(test$Loan_Status)
## 
## Charged Off  Fully Paid 
##          93          93

Data are balanced with downsampling.

b) Applying PCA with Linear Model

# Using balanced data. Taking only quantitative variables
trainQ <- train[, c(2,4,5,6,9,10,11,12,14,15)]
testQ <- test[, c(2,4,5,6,9,10,11,12,14,15)]

# Transform data, applying PCA. Taking Annual_Income as the predicted feature
preProc <- preProcess(trainQ[,-3], method=c('BoxCox', 'center', 'scale', 'pca'), Comp=7)
# Applying PCA, predict train data without predicted feature
train_pca <- predict(preProc, trainQ[, -3])
# Adding column that needs to be predicted
train_pca$Annual_Income <- trainQ$Annual_Income
# Examing
head(train_pca)
##          PC1         PC2         PC3        PC4        PC5        PC6
## 1 -1.2384075  0.35990852  0.65806763  0.2956382 -0.2399866 -2.1875899
## 2  0.6974771 -1.36486957 -2.09931766  1.4259874  0.2323596 -1.1441729
## 3  0.6328001  0.02077489 -0.04028476  1.1319218 -1.2986672  0.1996309
## 4  1.8143158 -0.26663688 -0.69619182 -2.4765150 -1.3897335 -0.4726096
## 5 -0.7621477  2.33690630  0.45145079  0.5155694  0.4110872  0.2193689
## 6  2.8200037 -0.59302518 -0.44706622  0.4545443  0.7041540 -0.1431837
##           PC7         PC8 Annual_Income
## 1 -1.19400314 -0.18413220       1839485
## 2  0.13067386 -0.52625933       1407900
## 3 -0.41993545  1.24339543        912475
## 4  1.46893381  0.91780031        597493
## 5  0.22624107 -0.53556678       1630998
## 6  0.01013493 -0.08170745        590254
# Applying PCA, predict test data without predicted feature
test_pca <- predict(preProc, testQ[, -3])
# Adding column that needs to be predicted
test_pca$Annual_Income <- testQ$Annual_Income
head(test_pca)
##          PC1        PC2        PC3          PC4          PC5          PC6
## 1  2.2459600 -1.8836049 -1.0282400  0.463054216 -0.853784239 -0.218438214
## 2  1.4532355  0.4486031 -1.6191810 -0.177268346  1.853533874 -0.460450740
## 3 -2.7134421 -0.7239373 -0.3312994 -2.223264592  0.145007336  0.004597723
## 4  0.8443699 -0.2447363  0.4355276 -0.389615315 -1.701711222  0.093582695
## 5  0.1342537  0.2137294  0.4170945  0.002490191 -0.003199088 -0.993774020
## 6  0.7799616 -2.9695812 -0.7796513  0.097079307 -1.140649816  1.294890401
##           PC7        PC8 Annual_Income
## 1  0.91988009 -1.0360133        653904
## 2  0.94072197 -0.3827557        929575
## 3  0.02907999 -0.7162236       1143363
## 4 -0.88834340 -0.5001129       1188127
## 5 -0.09279641  0.7185523       1398096
## 6 -0.23816557  0.9481315       1045000
# Fitting linear model using components
fit <- lm(Annual_Income~., data=train_pca)
print(fit$coefficients)
## (Intercept)         PC1         PC2         PC3         PC4         PC5 
##  1231675.86  -134713.29    73955.85   -35581.26    35790.16   -11429.15 
##         PC6         PC7         PC8 
##   -73420.25  -159213.59   -91150.09
summary(fit)
## 
## Call:
## lm(formula = Annual_Income ~ ., data = train_pca)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -755216 -303966  -80171  215266 1621992 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1231676      15225  80.898  < 2e-16 ***
## PC1          -134713       9186 -14.665  < 2e-16 ***
## PC2            73956      13777   5.368 1.07e-07 ***
## PC3           -35581      14334  -2.482   0.0133 *  
## PC4            35790      15162   2.360   0.0185 *  
## PC5           -11429      16918  -0.676   0.4995    
## PC6           -73420      17992  -4.081 4.98e-05 ***
## PC7          -159214      19058  -8.354 3.24e-16 ***
## PC8           -91150      20975  -4.346 1.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 417000 on 741 degrees of freedom
## Multiple R-squared:  0.3278, Adjusted R-squared:  0.3206 
## F-statistic: 45.17 on 8 and 741 DF,  p-value: < 2.2e-16
# Predicting on test set
predictions <- predict(fit, test_pca)
head(predictions)
##         1         2         3         4         5         6 
##  816741.6 1018085.2 1534549.3 1269987.2 1236923.1  807664.6
# Evaluation, root mean square error
rmse <- RMSE(predictions, testQ$Annual_Income)
rmse
## [1] 402142.3
error_rate <- rmse/mean(testQ$Annual_Income)
error_rate
## [1] 0.3246781

All Principal Components except PC5 are statistically significant looking from the p-values. Prediction with PCA gives RMSE of 402142 or 32.4%. The RMSE is large because the data have very large number units. R2 is 32%.

c) Visualization of PCA with Evaluation Statistics

1. Using Imbalanced Data

# Taking a subset of quantitative variables for easier showing the concepts. Taking only 300 observation because the data is too big to visualize
bankloan_sub <- bankloan[1:300, c('Annual_Income', 'Monthly_Debt', 'Months_since_last_delinquent', 'Maximum_Open_Credit')] # 'Maximum_Open_Credit', 'Credit_Score', 

Status_sub <- bankloan[1:300, 'Loan_Status']
# Summarizing the data
summary(bankloan_sub)
##  Annual_Income      Monthly_Debt   Months_since_last_delinquent
##  Min.   : 371564   Min.   : 1006   Min.   : 2.00               
##  1st Qu.: 874385   1st Qu.: 9742   1st Qu.:19.00               
##  Median :1199594   Median :14782   Median :36.00               
##  Mean   :1271442   Mean   :16435   Mean   :37.65               
##  3rd Qu.:1580159   3rd Qu.:22311   3rd Qu.:54.00               
##  Max.   :2913270   Max.   :42775   Max.   :81.00               
##  Maximum_Open_Credit
##  Min.   :  47432    
##  1st Qu.: 229515    
##  Median : 377564    
##  Mean   : 446304    
##  3rd Qu.: 613261    
##  Max.   :1252878
# Scale the data
bankloan_scaled <- scale(bankloan_sub)
summary(bankloan_scaled)
##  Annual_Income      Monthly_Debt     Months_since_last_delinquent
##  Min.   :-1.7116   Min.   :-1.7341   Min.   :-1.67735            
##  1st Qu.:-0.7552   1st Qu.:-0.7522   1st Qu.:-0.87749            
##  Median :-0.1367   Median :-0.1858   Median :-0.07763            
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000            
##  3rd Qu.: 0.5872   3rd Qu.: 0.6605   3rd Qu.: 0.76928            
##  Max.   : 3.1229   Max.   : 2.9606   Max.   : 2.03964            
##  Maximum_Open_Credit
##  Min.   :-1.4138    
##  1st Qu.:-0.7684    
##  Median :-0.2436    
##  Mean   : 0.0000    
##  3rd Qu.: 0.5918    
##  Max.   : 2.8588
# Creating and printing Principal Components using function prcomp
# habillage: an optional factor variable for coloring the observations by groups

fviz_pca_ind(prcomp(bankloan_scaled), title='Bank Loan Status Data', habillage=Status_sub, palette='jco', geom='point', ggtheme=theme_classic(), legend='bottom')

# Counting variation captured by each Principal Component
PCA_bankloan <- prcomp(bankloan_scaled)

summary(PCA_bankloan)
## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.3602 1.0258 0.8073 0.6678
## Proportion of Variance 0.4625 0.2631 0.1629 0.1115
## Cumulative Proportion  0.4625 0.7256 0.8885 1.0000
print(PCA_bankloan) # same as rotation/direction
## Standard deviations (1, .., p=4):
## [1] 1.3601776 1.0257757 0.8073244 0.6677787
## 
## Rotation (n x k) = (4 x 4):
##                                     PC1        PC2        PC3         PC4
## Annual_Income                0.58744397 -0.1669828  0.5055315 -0.60947866
## Monthly_Debt                 0.62100519 -0.1107894  0.1577300  0.75973647
## Months_since_last_delinquent 0.01269062  0.9330071  0.3558844  0.05179782
## Maximum_Open_Credit          0.51874954  0.2988983 -0.7700035 -0.22057480
PCA_bankloan$center # at the origin
##                Annual_Income                 Monthly_Debt 
##                 1.390092e-16                 2.575486e-17 
## Months_since_last_delinquent          Maximum_Open_Credit 
##                 7.039508e-17                 8.977194e-17
PCA_bankloan$rotation
##                                     PC1        PC2        PC3         PC4
## Annual_Income                0.58744397 -0.1669828  0.5055315 -0.60947866
## Monthly_Debt                 0.62100519 -0.1107894  0.1577300  0.75973647
## Months_since_last_delinquent 0.01269062  0.9330071  0.3558844  0.05179782
## Maximum_Open_Credit          0.51874954  0.2988983 -0.7700035 -0.22057480
# Visualization, Counting variation captured by each Principal Component
eigen_exp <- fviz_eig(PCA_bankloan)
eigen_exp

# Visualization, Principal Component as linear combination of each of the variables
eigen_contri1 <- fviz_contrib(PCA_bankloan, choice='var', axes=1)
eigen_contri1

eigen_contri2 <- fviz_contrib(PCA_bankloan, choice='var', axes=2)
eigen_contri2

eigen_contri1_2 <- fviz_contrib(PCA_bankloan, choice='var', axes=1:2)
eigen_contri1_2

# Quantitative
var <- get_pca_var(PCA_bankloan)
head(var$contrib)
##                                    Dim.1     Dim.2     Dim.3      Dim.4
## Annual_Income                34.50904123  2.788325 25.556209 37.1464239
## Monthly_Debt                 38.56474505  1.227429  2.487876 57.7199503
## Months_since_last_delinquent  0.01610518 87.050224 12.665370  0.2683014
## Maximum_Open_Credit          26.91010853  8.934022 59.290545  4.8653244

Looking at the clusters, Charge-Off and Fully-Paid look overlapping. Taking only four predictor features with original unbalanced data, PC1 accounts for 46.3% of data variation, and PC2 accounts for 26.3% of data variation. Monthly_Debt and Annual_Income contributes more than 25% for PC1. Months_since_last_delinquent contributes 80% of PC2.

2. Improvement: Using Balanced Data

# Shuffling data
train <- train[sample(nrow(train)), ]

# Using the downsampled balanced data (train). Taking a subset of quantitative variables for easier showing the concepts. 
bankloan_sub <- train[1:300, c('Annual_Income', 'Monthly_Debt', 'Months_since_last_delinquent', 'Maximum_Open_Credit')] # 'Maximum_Open_Credit', 'Credit_Score', 

Status_sub <- train[1:300, 'Loan_Status']
# Summarizing the data
summary(bankloan_sub)
##  Annual_Income      Monthly_Debt     Months_since_last_delinquent
##  Min.   : 284354   Min.   :  280.6   Min.   : 1.00               
##  1st Qu.: 855508   1st Qu.:10369.1   1st Qu.:18.00               
##  Median :1127289   Median :15891.0   Median :35.00               
##  Mean   :1218081   Mean   :16610.6   Mean   :36.74               
##  3rd Qu.:1536758   3rd Qu.:21993.4   3rd Qu.:54.00               
##  Max.   :2880210   Max.   :42506.2   Max.   :83.00               
##  Maximum_Open_Credit
##  Min.   :      0    
##  1st Qu.: 240432    
##  Median : 375100    
##  Mean   : 437638    
##  3rd Qu.: 587950    
##  Max.   :1212134
# Scale the data
bankloan_scaled <- scale(bankloan_sub)
summary(bankloan_scaled)
##  Annual_Income      Monthly_Debt      Months_since_last_delinquent
##  Min.   :-1.8834   Min.   :-1.89061   Min.   :-1.61011            
##  1st Qu.:-0.7313   1st Qu.:-0.72261   1st Qu.:-0.84418            
##  Median :-0.1831   Median :-0.08331   Median :-0.07825            
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000            
##  3rd Qu.: 0.6428   3rd Qu.: 0.62319   3rd Qu.: 0.77780            
##  Max.   : 3.3527   Max.   : 2.99808   Max.   : 2.08439            
##  Maximum_Open_Credit
##  Min.   :-1.6458    
##  1st Qu.:-0.7416    
##  Median :-0.2352    
##  Mean   : 0.0000    
##  3rd Qu.: 0.5653    
##  Max.   : 2.9126
# Creating and printing Principal Components using function prcomp
# habillage: an optional factor variable for coloring the observations by groups

fviz_pca_ind(prcomp(bankloan_scaled), title='Bank Loan Status Data', habillage=Status_sub, palette='jco', geom='point', ggtheme=theme_classic(), legend='bottom')

# Counting variation captured by each Principal Component
PCA_bankloan <- prcomp(bankloan_scaled)

summary(PCA_bankloan)
## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.3417 1.0074 0.8657 0.6598
## Proportion of Variance 0.4501 0.2537 0.1874 0.1089
## Cumulative Proportion  0.4501 0.7038 0.8911 1.0000
print(PCA_bankloan) # same as rotation/direction
## Standard deviations (1, .., p=4):
## [1] 1.3417459 1.0074132 0.8657078 0.6598384
## 
## Rotation (n x k) = (4 x 4):
##                                     PC1          PC2        PC3         PC4
## Annual_Income                 0.6100994 -0.069384108 -0.4138129 -0.67210379
## Monthly_Debt                  0.6311028  0.005781102 -0.2583915  0.73137519
## Months_since_last_delinquent -0.1224274  0.940644848 -0.3162529 -0.01352347
## Maximum_Open_Credit           0.4631410  0.332174192  0.8136196 -0.11482137
PCA_bankloan$center # at the origin
##                Annual_Income                 Monthly_Debt 
##                 1.694594e-16                 1.510019e-16 
## Months_since_last_delinquent          Maximum_Open_Credit 
##                 9.095155e-17                 5.027807e-17
PCA_bankloan$rotation
##                                     PC1          PC2        PC3         PC4
## Annual_Income                 0.6100994 -0.069384108 -0.4138129 -0.67210379
## Monthly_Debt                  0.6311028  0.005781102 -0.2583915  0.73137519
## Months_since_last_delinquent -0.1224274  0.940644848 -0.3162529 -0.01352347
## Maximum_Open_Credit           0.4631410  0.332174192  0.8136196 -0.11482137
# Visualization, Counting variation captured by each Principal Component
eigen_exp <- fviz_eig(PCA_bankloan)
eigen_exp

# Visualization, Principal Component as linear combination of each of the variables
eigen_contri1 <- fviz_contrib(PCA_bankloan, choice='var', axes=1)
eigen_contri1

eigen_contri2 <- fviz_contrib(PCA_bankloan, choice='var', axes=2)
eigen_contri2

eigen_contri1_2 <- fviz_contrib(PCA_bankloan, choice='var', axes=1:2)
eigen_contri1_2

# Quantitative
var <- get_pca_var(PCA_bankloan)
head(var$contrib)
##                                  Dim.1        Dim.2     Dim.3       Dim.4
## Annual_Income                37.222123  0.481415442 17.124112 45.17235024
## Monthly_Debt                 39.829074  0.003342114  6.676617 53.49096664
## Months_since_last_delinquent  1.498848 88.481273071 10.001591  0.01828842
## Maximum_Open_Credit          21.449956 11.033969373 66.197680  1.31839471

With the balanced data, Charge-Off and Fully-Paid still look overlapping from the cluster. PC1 accounts for 45% of data variation, and PC2 accounts for 25.4% of data variation. This is slightly less than the variation accounted with the original unbalanced data. This is surprising to me. Monthly_Debt and Annual_Income contributes more than 25% for PC1.

3. Improvement: Using All Features

# Using balanced data. Taking all quantitative variables
bankloan_sub <- train[1:300, c(2,4,5,6,9,10,11,12,14,15)]

Status_sub <- train[1:300, 'Loan_Status']
# Summarizing the data
summary(bankloan_sub)
##  Current_Loan_Amount  Credit_Score   Annual_Income     Years_in_current_job
##  Min.   : 31108      Min.   :648.0   Min.   : 284354   Min.   : 1.000      
##  1st Qu.:177452      1st Qu.:696.8   1st Qu.: 855508   1st Qu.: 2.750      
##  Median :271865      Median :716.0   Median :1127289   Median : 6.000      
##  Mean   :305396      Mean   :712.2   Mean   :1218081   Mean   : 5.987      
##  3rd Qu.:394735      3rd Qu.:733.0   3rd Qu.:1536758   3rd Qu.:10.000      
##  Max.   :786478      Max.   :749.0   Max.   :2880210   Max.   :10.000      
##   Monthly_Debt     Years_of_Credit_History Months_since_last_delinquent
##  Min.   :  280.6   Min.   : 7.90           Min.   : 1.00               
##  1st Qu.:10369.1   1st Qu.:14.50           1st Qu.:18.00               
##  Median :15891.0   Median :17.00           Median :35.00               
##  Mean   :16610.6   Mean   :18.45           Mean   :36.74               
##  3rd Qu.:21993.4   3rd Qu.:21.30           3rd Qu.:54.00               
##  Max.   :42506.2   Max.   :46.10           Max.   :83.00               
##  Number_of_Open_Accounts Current_Credit_Balance Maximum_Open_Credit
##  Min.   : 2.00           Min.   :     0         Min.   :      0    
##  1st Qu.: 8.00           1st Qu.:109858         1st Qu.: 240432    
##  Median :10.00           Median :180405         Median : 375100    
##  Mean   :10.92           Mean   :208824         Mean   : 437638    
##  3rd Qu.:13.00           3rd Qu.:288600         3rd Qu.: 587950    
##  Max.   :32.00           Max.   :842593         Max.   :1212134
# Scale the data
bankloan_scaled <- scale(bankloan_sub)
summary(bankloan_scaled)
##  Current_Loan_Amount  Credit_Score     Annual_Income     Years_in_current_job
##  Min.   :-1.5968     Min.   :-2.6124   Min.   :-1.8834   Min.   :-1.418733   
##  1st Qu.:-0.7449     1st Qu.:-0.6285   1st Qu.:-0.7313   1st Qu.:-0.920849   
##  Median :-0.1952     Median : 0.1549   Median :-0.1831   Median : 0.003793   
##  Mean   : 0.0000     Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000000   
##  3rd Qu.: 0.5201     3rd Qu.: 0.8468   3rd Qu.: 0.6428   3rd Qu.: 1.141814   
##  Max.   : 2.8008     Max.   : 1.4979   Max.   : 3.3527   Max.   : 1.141814   
##   Monthly_Debt      Years_of_Credit_History Months_since_last_delinquent
##  Min.   :-1.89061   Min.   :-1.7619         Min.   :-1.61011            
##  1st Qu.:-0.72261   1st Qu.:-0.6594         1st Qu.:-0.84418            
##  Median :-0.08331   Median :-0.2417         Median :-0.07825            
##  Mean   : 0.00000   Mean   : 0.0000         Mean   : 0.00000            
##  3rd Qu.: 0.62319   3rd Qu.: 0.4766         3rd Qu.: 0.77780            
##  Max.   : 2.99808   Max.   : 4.6195         Max.   : 2.08439            
##  Number_of_Open_Accounts Current_Credit_Balance Maximum_Open_Credit
##  Min.   :-2.0487         Min.   :-1.4423        Min.   :-1.6458    
##  1st Qu.:-0.6707         1st Qu.:-0.6835        1st Qu.:-0.7416    
##  Median :-0.2113         Median :-0.1963        Median :-0.2352    
##  Mean   : 0.0000         Mean   : 0.0000        Mean   : 0.0000    
##  3rd Qu.: 0.4777         3rd Qu.: 0.5510        3rd Qu.: 0.5653    
##  Max.   : 4.8416         Max.   : 4.3772        Max.   : 2.9126
# Creating and printing Principal Components using function prcomp
# habillage: an optional factor variable for coloring the observations by groups

fviz_pca_ind(prcomp(bankloan_scaled), title='Bank Loan Status Data', habillage=Status_sub, palette='jco', geom='point', ggtheme=theme_classic(), legend='bottom')

# Counting variation captured by each Principal Component
PCA_bankloan <- prcomp(bankloan_scaled)

summary(PCA_bankloan)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.7862 1.1417 1.0337 0.97749 0.95933 0.89286 0.87047
## Proportion of Variance 0.3191 0.1303 0.1069 0.09555 0.09203 0.07972 0.07577
## Cumulative Proportion  0.3191 0.4494 0.5563 0.65181 0.74384 0.82356 0.89933
##                            PC8     PC9    PC10
## Standard deviation     0.65689 0.62685 0.42689
## Proportion of Variance 0.04315 0.03929 0.01822
## Cumulative Proportion  0.94248 0.98178 1.00000
print(PCA_bankloan) # same as rotation/direction
## Standard deviations (1, .., p=10):
##  [1] 1.7862493 1.1416702 1.0336827 0.9774934 0.9593270 0.8928578 0.8704715
##  [8] 0.6568943 0.6268464 0.4268943
## 
## Rotation (n x k) = (10 x 10):
##                                      PC1        PC2         PC3          PC4
## Current_Loan_Amount           0.40701343 -0.1952349  0.23488119  0.133885112
## Credit_Score                 -0.19309394  0.2471000 -0.70633231 -0.177892866
## Annual_Income                 0.33934497 -0.2673582 -0.35350511  0.373712612
## Years_in_current_job          0.16912414 -0.3672567 -0.06053996 -0.057292875
## Monthly_Debt                  0.40576503 -0.0664915 -0.22989823  0.384267013
## Years_of_Credit_History       0.20183701 -0.3025326  0.10877495 -0.661801819
## Months_since_last_delinquent -0.03719702  0.4276409  0.45420744  0.310033556
## Number_of_Open_Accounts       0.26241882  0.4694794 -0.16585772 -0.006102478
## Current_Credit_Balance        0.45266551  0.2292062  0.12257523 -0.220384525
## Maximum_Open_Credit           0.41564415  0.3777523 -0.06141242 -0.277696137
##                                      PC5         PC6          PC7         PC8
## Current_Loan_Amount          -0.05175622  0.05069197 -0.328710079 -0.71398603
## Credit_Score                  0.30859806 -0.23513688 -0.234633903 -0.23517423
## Annual_Income                 0.05442396 -0.38245289  0.009252672 -0.08505729
## Years_in_current_job          0.74958516  0.49747815  0.093894358  0.10469261
## Monthly_Debt                 -0.11325493 -0.11758078  0.223550562  0.39811129
## Years_of_Credit_History       0.04781816 -0.48510199  0.404106290 -0.07577986
## Months_since_last_delinquent  0.55571560 -0.43131846  0.120436130 -0.03961992
## Number_of_Open_Accounts      -0.09853217  0.32660982  0.641716849 -0.33484611
## Current_Credit_Balance       -0.01344433 -0.02000199 -0.315681161  0.36483591
## Maximum_Open_Credit           0.05853400  0.07827771 -0.297923035  0.07542986
##                                       PC9         PC10
## Current_Loan_Amount          -0.316310269  0.001445027
## Credit_Score                 -0.313783047  0.108458262
## Annual_Income                 0.610180369  0.140943219
## Years_in_current_job         -0.001624696  0.020741613
## Monthly_Debt                 -0.577299394 -0.249048829
## Years_of_Credit_History      -0.087639247 -0.059888041
## Months_since_last_delinquent -0.039218127 -0.029091854
## Number_of_Open_Accounts       0.080439561  0.190798187
## Current_Credit_Balance       -0.054694992  0.665306097
## Maximum_Open_Credit           0.277866761 -0.649957790
PCA_bankloan$center # at the origin
##          Current_Loan_Amount                 Credit_Score 
##                 8.777701e-17                -2.095060e-15 
##                Annual_Income         Years_in_current_job 
##                 1.694594e-16                 1.250533e-16 
##                 Monthly_Debt      Years_of_Credit_History 
##                 1.510019e-16                 1.277855e-16 
## Months_since_last_delinquent      Number_of_Open_Accounts 
##                 9.095155e-17                 1.324172e-17 
##       Current_Credit_Balance          Maximum_Open_Credit 
##                 3.416249e-17                 5.027807e-17
PCA_bankloan$rotation
##                                      PC1        PC2         PC3          PC4
## Current_Loan_Amount           0.40701343 -0.1952349  0.23488119  0.133885112
## Credit_Score                 -0.19309394  0.2471000 -0.70633231 -0.177892866
## Annual_Income                 0.33934497 -0.2673582 -0.35350511  0.373712612
## Years_in_current_job          0.16912414 -0.3672567 -0.06053996 -0.057292875
## Monthly_Debt                  0.40576503 -0.0664915 -0.22989823  0.384267013
## Years_of_Credit_History       0.20183701 -0.3025326  0.10877495 -0.661801819
## Months_since_last_delinquent -0.03719702  0.4276409  0.45420744  0.310033556
## Number_of_Open_Accounts       0.26241882  0.4694794 -0.16585772 -0.006102478
## Current_Credit_Balance        0.45266551  0.2292062  0.12257523 -0.220384525
## Maximum_Open_Credit           0.41564415  0.3777523 -0.06141242 -0.277696137
##                                      PC5         PC6          PC7         PC8
## Current_Loan_Amount          -0.05175622  0.05069197 -0.328710079 -0.71398603
## Credit_Score                  0.30859806 -0.23513688 -0.234633903 -0.23517423
## Annual_Income                 0.05442396 -0.38245289  0.009252672 -0.08505729
## Years_in_current_job          0.74958516  0.49747815  0.093894358  0.10469261
## Monthly_Debt                 -0.11325493 -0.11758078  0.223550562  0.39811129
## Years_of_Credit_History       0.04781816 -0.48510199  0.404106290 -0.07577986
## Months_since_last_delinquent  0.55571560 -0.43131846  0.120436130 -0.03961992
## Number_of_Open_Accounts      -0.09853217  0.32660982  0.641716849 -0.33484611
## Current_Credit_Balance       -0.01344433 -0.02000199 -0.315681161  0.36483591
## Maximum_Open_Credit           0.05853400  0.07827771 -0.297923035  0.07542986
##                                       PC9         PC10
## Current_Loan_Amount          -0.316310269  0.001445027
## Credit_Score                 -0.313783047  0.108458262
## Annual_Income                 0.610180369  0.140943219
## Years_in_current_job         -0.001624696  0.020741613
## Monthly_Debt                 -0.577299394 -0.249048829
## Years_of_Credit_History      -0.087639247 -0.059888041
## Months_since_last_delinquent -0.039218127 -0.029091854
## Number_of_Open_Accounts       0.080439561  0.190798187
## Current_Credit_Balance       -0.054694992  0.665306097
## Maximum_Open_Credit           0.277866761 -0.649957790
# Visualization, Counting variation captured by each Principal Component
eigen_exp <- fviz_eig(PCA_bankloan)
eigen_exp

# Visualization, Principal Component as linear combination of each of the variables
eigen_contri1 <- fviz_contrib(PCA_bankloan, choice='var', axes=1)
eigen_contri1

eigen_contri2 <- fviz_contrib(PCA_bankloan, choice='var', axes=2)
eigen_contri2

eigen_contri1_2 <- fviz_contrib(PCA_bankloan, choice='var', axes=1:2)
eigen_contri1_2

# Quantitative
var <- get_pca_var(PCA_bankloan)
head(var$contrib)
##                             Dim.1      Dim.2      Dim.3      Dim.4      Dim.5
## Current_Loan_Amount     16.565993  3.8116650  5.5169175  1.7925223  0.2678706
## Credit_Score             3.728527  6.1058423 49.8905338  3.1645872  9.5232763
## Annual_Income           11.515501  7.1480404 12.4965862 13.9661116  0.2961967
## Years_in_current_job     2.860297 13.4877493  0.3665087  0.3282474 56.1877916
## Monthly_Debt            16.464526  0.4421119  5.2853197 14.7661137  1.2826679
## Years_of_Credit_History  4.073818  9.1525972  1.1831989 43.7981648  0.2286577
##                              Dim.6        Dim.7      Dim.8        Dim.9
## Current_Loan_Amount      0.2569676 10.805031600 50.9776050 1.000522e+01
## Credit_Score             5.5289353  5.505306831  5.5306919 9.845980e+00
## Annual_Income           14.6270215  0.008561193  0.7234742 3.723201e+01
## Years_in_current_job    24.7484510  0.881615052  1.0960542 2.639638e-04
## Monthly_Debt             1.3825241  4.997485369 15.8492601 3.332746e+01
## Years_of_Credit_History 23.5323940 16.330189373  0.5742587 7.680638e-01
##                               Dim.10
## Current_Loan_Amount     0.0002088102
## Credit_Score            1.1763194542
## Annual_Income           1.9864990910
## Years_in_current_job    0.0430214491
## Monthly_Debt            6.2025319326
## Years_of_Credit_History 0.3586577477

With all quantitative variables, we need to include PC1-PC6 in order to account for >80% variation.

STEP 5: K-MEANS VISUALIZATION WITH EVALUATION STATISTICS

# Using balanced data
# Taking only 300 observation because the data is too big to visualize
bankloan_sub <- train[1:300, c('Annual_Income', 'Monthly_Debt', 'Months_since_last_delinquent', 'Maximum_Open_Credit')] # 'Maximum_Open_Credit', 'Credit_Score', 

# Scale the data
bankloan_scaled <- scale(bankloan_sub)
#summary(bankloan_scaled)

# Elbow Method
elbow_figure <- fviz_nbclust(bankloan_scaled, kmeans, method='wss')
elbow_figure

# Silhouette Score
sil <- fviz_nbclust(bankloan_scaled, kmeans, method='silhouette')
sil

# Gap Statistics
gap_stat <- clusGap(bankloan_scaled, FUN=kmeans, nstart=25, K.max=10, B=50)
fviz_gap_stat(gap_stat)

# Applying kMeans
kmeans_clust <- kmeans(bankloan_scaled, 2)

# Visualize Cluster Plot
fviz_cluster(list(data=bankloan_scaled, cluster=kmeans_clust$cluster), ellipse.type='norm', geom='point', stand=FALSE, palette='jco')

# Calculating hopkins statistics which show if the data exhibit inherent patterns
print(hopkins(bankloan_scaled, n=nrow(bankloan_scaled)-1))
## $H
## [1] 0.3729719
# Visualizing the dissimilarity matrix
# Pink/red is high in similarity, blue/purple is low in similarity.
fviz_dist(dist(bankloan_scaled), show_labels = FALSE) +
  labs(title='Bank Loan Status Dataset')

The Hopkins Statistics is 0.37. This indicates data do not have good separability, but rather random. This agrees with the PCA clustering in the previous section. However, supervised algorithms like k-Nearest Neighbors and Random Forest give 98%-100% accuracy. This is probably this data work well for supervised algorithms but not so great on unsupervised algorithms like PCA and k-Means. Also categorical variables were included in the supervised algorithms.

In terms of predicting number of clusters, both Elbow Method and Silhouhette Score predict 2 is optimal, while Gap predicted 1, which means all data are the same group and this would not make sense.

STEP 6: HIERARCHICAL CLUSTERING

a) Using Ward Linkage Function

# Taking a subset of quantitative variables for easier showing the concepts. Taking only 300 observation of the balanced data because the data is too big to visualize
bankloan_sub <- train[1:300, c('Annual_Income', 'Monthly_Debt', 'Months_since_last_delinquent', 'Maximum_Open_Credit')] # 'Maximum_Open_Credit', 'Credit_Score', 

# Scale the data
bankloan_scaled <- scale(bankloan_sub)
#summary(bankloan_scaled)

# Creating dissimilarity matrix suing Euclidean distance
bankloan_dist <- dist(bankloan_scaled, method='euclidean')

# Linkage function utilizes the distance as a proximity metric and pair wise merges the instances thereby creating larger clusters with every successive iteration. Using linkage function ward 2 that creates clusters by minimizing variance.
# https://en.wikipedia.org/wiki/Ward%27s_method

agg_tree_ward <- hclust(d=bankloan_dist, method='ward.D2')
print(agg_tree_ward)
## 
## Call:
## hclust(d = bankloan_dist, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 300
# Visualizing Dendrogram
fviz_dend(agg_tree_ward, cex=.5)

# Cutting tree to create 2 clusters and visualizing it
agg_tree_ward_dend <- fviz_dend(agg_tree_ward, cex=.5, k=2, palette='jco')
agg_tree_ward_dend

# To access the partition accuracy of the cluster tree (created by hclust()) there should be a strong correlation between the original distance matrix and the object linkage distance defined as Cophenetic Distances. 
# Cophenetic correlation coefficient measures of how faithfully a dendogram preserves the pairwise distances between the original data points.

agg_cophenetic <- cophenetic(agg_tree_ward)

cor(bankloan_dist, agg_cophenetic)
## [1] 0.5038476

Using the Ward Linkage function, I get Cophenetic correlation coefficient of 0.5. I used balanced data for this, but the cluster Dendrogram show one cluster is twice as big as the other.

b) Improvement: Using Average Linkage Function

agg_tree_avg <- hclust(d=bankloan_dist, method='average')
print(agg_tree_avg)
## 
## Call:
## hclust(d = bankloan_dist, method = "average")
## 
## Cluster method   : average 
## Distance         : euclidean 
## Number of objects: 300
# Visualizing Dendrogram
fviz_dend(agg_tree_avg, cex=.5)

# Cutting tree to create 2 clusters and visualizing it
agg_tree_avg_dend <- fviz_dend(agg_tree_avg, cex=.5, k=2, palette='jco')
agg_tree_avg_dend

# Cophenetic Distance
agg_cophenetic <- cophenetic(agg_tree_avg)
# Correlation between Cophenetic distances and original distances
cor(bankloan_dist, agg_cophenetic)
## [1] 0.6559579
# Cutting tree into two clusters and Visualizing it
two_groups <- cutree(agg_tree_avg, k=2)
fviz_cluster(list(data=bankloan_scaled, cluster=two_groups))

Using the Average Linkage function, I get Cophenetic correlation coefficient of 0.65, which is better than the Ward Linkage function. However, the dendrogram and cluster plot show that one cluster is much larger than the other. This is not true because I was using balanced data. Therefore, Average Linkage function does not work well here.

c) Improvement: Using Complete Linkage Function

# Using the Complete (i.e. Maximum) linkage 
agg_tree_max <- hclust(d=bankloan_dist, method='complete')
print(agg_tree_max)
## 
## Call:
## hclust(d = bankloan_dist, method = "complete")
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 300
# Visualizing Dendrogram
fviz_dend(agg_tree_max, cex=.5)

# Cutting tree to create 2 clusters and visualizing it
agg_tree_max_dend <- fviz_dend(agg_tree_max, cex=.5, k=2, palette='jco')
agg_tree_max_dend

# Cophenetic Distance
agg_cophenetic <- cophenetic(agg_tree_max)
# Correlation between Cophenetic distances and original distances
cor(bankloan_dist, agg_cophenetic)
## [1] 0.5799617

Similar to Average Linkage, using Complete/Maximum Linkage function, the dendrogram shows that one cluster is much larger than the other. This is not true because I was using balanced data. Therefore, the Ward Linkage function works best for my data.

d) Using the Cluster package for Agglomerative and Divisive Methods

# Agglomerrative (Bottom-up approach)
agnes_cluster <- agnes(x=bankloan_scaled, stand=TRUE, metric='euclidean', method='average')
# Plune tree
#agnes_tree <- pltree(agnes_cluster, cex=.5, hang=-1, main='Dendrogram of Agnes')
fviz_dend(agnes_cluster, cex=.6, k=2, main='Dendrogram of Agnes')

# Divisive (Top-down approach). Divisive approach does not need linkage function
diana_cluster <- diana(x=bankloan_scaled, stand=TRUE, metric='euclidean')
fviz_dend(diana_cluster, cex=.6, k=2, main='Dendrogram of Diana')

Using Cluster Package, Divisive (top-down) gives better cluster than Agglomerrative (bottom-up) method by looking at the size balance of the two cluster.

e) Heatmap

library(pheatmap)

knitr::kable(str(bankloan_scaled))
##  num [1:300, 1:4] -0.722 0.166 0.101 0.902 -0.549 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:300] "179" "526" "195" "118" ...
##   ..$ : chr [1:4] "Annual_Income" "Monthly_Debt" "Months_since_last_delinquent" "Maximum_Open_Credit"
##  - attr(*, "scaled:center")= Named num [1:4] 1.22e+06 1.66e+04 3.67e+01 4.38e+05
##   ..- attr(*, "names")= chr [1:4] "Annual_Income" "Monthly_Debt" "Months_since_last_delinquent" "Maximum_Open_Credit"
##  - attr(*, "scaled:scale")= Named num [1:4] 495765 8637.4 22.2 265914.4
##   ..- attr(*, "names")= chr [1:4] "Annual_Income" "Monthly_Debt" "Months_since_last_delinquent" "Maximum_Open_Credit"

|| || || ||

knitr::kable(summary(bankloan_scaled))
Annual_Income Monthly_Debt Months_since_last_delinquent Maximum_Open_Credit
Min. :-1.8834 Min. :-1.89061 Min. :-1.61011 Min. :-1.6458
1st Qu.:-0.7313 1st Qu.:-0.72261 1st Qu.:-0.84418 1st Qu.:-0.7416
Median :-0.1831 Median :-0.08331 Median :-0.07825 Median :-0.2352
Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.6428 3rd Qu.: 0.62319 3rd Qu.: 0.77780 3rd Qu.: 0.5653
Max. : 3.3527 Max. : 2.99808 Max. : 2.08439 Max. : 2.9126
# Heatmaps are used for Visualizing Hierarchical clustering.
# Heat Maps are used to visualize clusters of samples and features. The high values are in red and low in blue.
# cellheight can be set to a higher number to see the details of each row (such as representing a gene)
pheatmap(bankloan_scaled, cutree_rows = 2, cellheight = .4)

STEP 7: COMPARING CLUSTERING ALGORITHMS

clValid reports validation measures for clustering results. The function returns an object of class “’>clValid”, which contains the clustering results in addition to the validation measures. The validation measures fall into three general categories: “internal”, “stability”, and “biological”.

a) Internal Validation Measures

library(clValid)

bankloan_sub <- train[1:300, c('Annual_Income', 'Monthly_Debt', 'Months_since_last_delinquent', 'Maximum_Open_Credit')] # 'Maximum_Open_Credit', 'Credit_Score', 

# Scale the data
bankloan_scaled <- scale(bankloan_sub)
knitr::kable(summary(bankloan_scaled))
Annual_Income Monthly_Debt Months_since_last_delinquent Maximum_Open_Credit
Min. :-1.8834 Min. :-1.89061 Min. :-1.61011 Min. :-1.6458
1st Qu.:-0.7313 1st Qu.:-0.72261 1st Qu.:-0.84418 1st Qu.:-0.7416
Median :-0.1831 Median :-0.08331 Median :-0.07825 Median :-0.2352
Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
3rd Qu.: 0.6428 3rd Qu.: 0.62319 3rd Qu.: 0.77780 3rd Qu.: 0.5653
Max. : 3.3527 Max. : 2.99808 Max. : 2.08439 Max. : 2.9126
# Comparing Hierarchical, k-Means and PAM (Partitioning Around Mediods -- medians) algorithms
clmethods <- c('hierarchical', 'kmeans', 'pam')

# Using 'internal' as evaluation metric
evaluation <- clValid(bankloan_scaled, nClust=2:6, clMethods=clmethods, validation='internal')

summary(evaluation)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                                   2        3        4        5        6
##                                                                        
## hierarchical Connectivity    9.0060  50.4532  53.1821  67.8536  69.2532
##              Dunn            0.1456   0.0880   0.0880   0.1051   0.1051
##              Silhouette      0.3627   0.2592   0.1871   0.1814   0.1733
## kmeans       Connectivity   67.9052 113.3679 108.8123 131.1357 167.3623
##              Dunn            0.0694   0.0516   0.0677   0.0763   0.0493
##              Silhouette      0.2841   0.2388   0.2477   0.2478   0.2338
## pam          Connectivity   78.1131 107.1226 159.7254 179.6643 205.6044
##              Dunn            0.0683   0.0563   0.0513   0.0649   0.0624
##              Silhouette      0.2552   0.2354   0.1921   0.2115   0.1873
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 9.0060 hierarchical 2       
## Dunn         0.1456 hierarchical 2       
## Silhouette   0.3627 hierarchical 2
optimalScores(evaluation)
##                  Score       Method Clusters
## Connectivity 9.0059524 hierarchical        2
## Dunn         0.1456297 hierarchical        2
## Silhouette   0.3627021 hierarchical        2
# Visualization
plot(evaluation)

Comparing k-Means, Hierarchical, and PAM (Partitioning Around Mediods – medians) algorithms, Hierarchical clustering gives the best Internal validation measures in terms of connectivity (want minimum), Dunn index (want as close to 1) and Silhouette Score (want as close to 1).

b) Stability Measures

# Using 'stability' as validation measure
evaluation <- clValid(bankloan_scaled, nClust=2:6, clMethods=clmethods, validation='stability')

summary(evaluation)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                        2      3      4      5      6
##                                                     
## hierarchical APN  0.0079 0.0960 0.2284 0.2976 0.3032
##              AD   2.5664 2.4465 2.3417 2.2985 2.2746
##              ADM  0.1023 0.7024 0.6147 0.7919 0.7977
##              FOM  0.9811 0.9617 0.9399 0.9366 0.9338
## kmeans       APN  0.1620 0.4120 0.3055 0.3207 0.4384
##              AD   2.3116 2.2991 2.0601 1.9948 1.9841
##              ADM  0.4041 0.9948 0.7263 0.7808 0.9347
##              FOM  0.9414 0.9503 0.9350 0.9310 0.9298
## pam          APN  0.2103 0.2950 0.4282 0.4625 0.4973
##              AD   2.3402 2.1755 2.1488 2.0755 2.0263
##              ADM  0.5056 0.6582 0.9623 1.0007 1.0218
##              FOM  0.9445 0.9475 0.9359 0.9459 0.9343
## 
## Optimal Scores:
## 
##     Score  Method       Clusters
## APN 0.0079 hierarchical 2       
## AD  1.9841 kmeans       6       
## ADM 0.1023 hierarchical 2       
## FOM 0.9298 kmeans       6
optimalScores(evaluation)
##           Score       Method Clusters
## APN 0.007888128 hierarchical        2
## AD  1.984122062       kmeans        6
## ADM 0.102312826 hierarchical        2
## FOM 0.929843323       kmeans        6
# Visualization
plot(evaluation)

Comparing k-Means, Hierarchical, and PAM (Partitioning Around Mediods – medians) algorithms, Hierarchical gives the best stability measures in terms of APN and ADN. It predicts 2 cluster is the optimal number. k-Means gives the best stability in terms of AD and FOM. It predicts 6 clusters is the optimal number. So Hierarchical is a better prediction, since I know the data has only two classes.

STEP 8: HYPOTHESIS TESTING FOR K-NN AND RANDOM FOREST

a) k-Nearest Neighbors Algorithm

set.seed(9650)

# Divide data into train and test sets
indexTrain <- createDataPartition(y=bankloan$Loan_Status, p=0.75, list=FALSE) 
training <- bankloan[indexTrain, ]
testing <- bankloan[-indexTrain, ]

table(training$Loan_Status)
## 
## Charged Off  Fully Paid 
##        3471       15418
# ROSE resampling to balance data
down_train <- downSample(x=training[, -ncol(training)], y=training$Loan_Status)
down_test <- downSample(x=testing[, -ncol(testing)], y=testing$Loan_Status)

table(down_train$Loan_Status)
## 
## Charged Off  Fully Paid 
##        3471        3471
table(down_test$Loan_Status)
## 
## Charged Off  Fully Paid 
##        1157        1157
# cross validation and fitting model
fitControl <- trainControl(method='repeatedcv', number=10)
down_kNNfit <- train(Loan_Status~., data=down_train, method='knn', trControl=fitControl, preProcess=c('BoxCox', "center", "scale"), tuneLength=12)

#print output
down_kNNfit
## k-Nearest Neighbors 
## 
## 6942 samples
##   16 predictor
##    2 classes: 'Charged Off', 'Fully Paid' 
## 
## Pre-processing: Box-Cox transformation (6), centered (32), scaled (32) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6248, 6248, 6248, 6248, 6248, 6248, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9752236  0.9504474
##    7  0.9775274  0.9550549
##    9  0.9783928  0.9567858
##   11  0.9794000  0.9588000
##   13  0.9801202  0.9602406
##   15  0.9815605  0.9631212
##   17  0.9815603  0.9631207
##   19  0.9825686  0.9651372
##   21  0.9824251  0.9648502
##   23  0.9814163  0.9628325
##   25  0.9825692  0.9651384
##   27  0.9825696  0.9651393
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 27.
plot(down_kNNfit)

# Prediction
down_kNNpredict <- predict(down_kNNfit, newdata=down_test)

# confusion matrix to see accuracy and other parameter values
confusionMatrix(down_kNNpredict, down_test$Loan_Status)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Charged Off Fully Paid
##   Charged Off        1129         19
##   Fully Paid           28       1138
##                                          
##                Accuracy : 0.9797         
##                  95% CI : (0.9731, 0.985)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9594         
##                                          
##  Mcnemar's Test P-Value : 0.2432         
##                                          
##             Sensitivity : 0.9758         
##             Specificity : 0.9836         
##          Pos Pred Value : 0.9834         
##          Neg Pred Value : 0.9760         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4879         
##    Detection Prevalence : 0.4961         
##       Balanced Accuracy : 0.9797         
##                                          
##        'Positive' Class : Charged Off    
## 
# calculate accuracy
mean(down_kNNpredict == down_test$Loan_Status)
## [1] 0.9796889

b) Random Forest Algorithm

set.seed(9650)

# Number of variables randomly sampled as candidates at each split.
mtry = sqrt(ncol(training))

# setting the mtry value
tunegrid <- expand.grid(mtry=mtry)

# Cross Validation for parameter tuning
fitControl <- trainControl(method='repeatedcv', number=10, repeats=1)

# train the model
RFfit_down <- train(Loan_Status~., data=down_train, method='rf', ntree=100, tunegrid=tunegrid, trControl=fitControl, preProcess=c('BoxCox', "center", "scale"))

#print output
RFfit_down
## Random Forest 
## 
## 6942 samples
##   16 predictor
##    2 classes: 'Charged Off', 'Fully Paid' 
## 
## Pre-processing: Box-Cox transformation (6), centered (32), scaled (32) 
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 6248, 6248, 6248, 6248, 6248, 6247, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9989918  0.9979835
##   17    1.0000000  1.0000000
##   32    1.0000000  1.0000000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 17.
plot(RFfit_down)

# Prediction
RFpredict_down <- predict(RFfit_down, newdata=down_test)

# confusion matrix to see accuracy and other parameter values
confusionMatrix(RFpredict_down, down_test$Loan_Status)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Charged Off Fully Paid
##   Charged Off        1157          0
##   Fully Paid            0       1157
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9984, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0        
##             Specificity : 1.0        
##          Pos Pred Value : 1.0        
##          Neg Pred Value : 1.0        
##              Prevalence : 0.5        
##          Detection Rate : 0.5        
##    Detection Prevalence : 0.5        
##       Balanced Accuracy : 1.0        
##                                      
##        'Positive' Class : Charged Off
## 
# compute accuracy by comparing prediction with actual classification
mean(RFpredict_down == down_test$Loan_Status)
## [1] 1

c) T-test Comparing kNN and RF

set.seed(9650)

# Original data is factor or integers. Have to convert it into numeric or double for t-test comparison to work
kNNpredict <- as.numeric(down_kNNpredict)
RFpredict <- as.numeric(RFpredict_down)
typeof(kNNpredict)
## [1] "double"
typeof(RFpredict_down)
## [1] "integer"
summary(kNNpredict)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   1.504   2.000   2.000
summary(RFpredict_down)
## Charged Off  Fully Paid 
##        1157        1157
# Paired t-test is applied, because they are the same observations and analyzed using different algorithms.
# alternative='less' propose that kNN is less accurate than Random Forest
# Testing the prediction
t.test(kNNpredict, RFpredict, alternative = 'less', paired = TRUE, conf.level = 0.95)
## 
##  Paired t-test
## 
## data:  kNNpredict and RFpredict
## t = 1.313, df = 2313, p-value = 0.9053
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##         -Inf 0.008763743
## sample estimates:
## mean of the differences 
##             0.003889369
# alternative = 'two.sided', propose that two algorithms are different
t.test(kNNpredict, RFpredict, alternative = 'two.sided', paired = TRUE, conf.level = 0.95)
## 
##  Paired t-test
## 
## data:  kNNpredict and RFpredict
## t = 1.313, df = 2313, p-value = 0.1893
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.001919520  0.009698258
## sample estimates:
## mean of the differences 
##             0.003889369
# Testing the model
#t.test(down_kNNfit, RFfit_down, alternative = 'less', paired = TRUE, conf.level = 0.95)

summary(down_kNNfit)
##             Length Class      Mode     
## learn        2     -none-     list     
## k            1     -none-     numeric  
## theDots      0     -none-     list     
## xNames      32     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        0     -none-     list
summary(RFfit_down)
##                 Length Class      Mode     
## call                6  -none-     call     
## type                1  -none-     character
## predicted        6942  factor     numeric  
## err.rate          300  -none-     numeric  
## confusion           6  -none-     numeric  
## votes           13884  matrix     numeric  
## oob.times        6942  -none-     numeric  
## classes             2  -none-     character
## importance         32  -none-     numeric  
## importanceSD        0  -none-     NULL     
## localImportance     0  -none-     NULL     
## proximity           0  -none-     NULL     
## ntree               1  -none-     numeric  
## mtry                1  -none-     numeric  
## forest             14  -none-     list     
## y                6942  factor     numeric  
## test                0  -none-     NULL     
## inbag               0  -none-     NULL     
## xNames             32  -none-     character
## problemType         1  -none-     character
## tuneValue           1  data.frame list     
## obsLevels           2  -none-     character
## param               2  -none-     list
# Can not compare two models because they have different parameters.

This section is a hypothesis testing to see if k-NN and Random Forest the different performance is statically significant or is it by chance.

p-value = 0.9 (>0.05) for “k-NN has lower performance than RF”. Therefore, data do not provide enough evidence that k-NN has lower performance than RF.

p-value = 0.19 (>0.05) for “k-NN has different performance than RF”. Therefore, data do not provide enough evidence that k-NN has different performance than RF.

Taken together, the difference of k-NN performance accuracy of 98% and RF performance accuracy of 100% is by chance and not stastically significant.

STEP 9: CONCLUSION

Unsupervised algorithms are more challenging than supervised algorithm because there is no target feature to train the model. That is the reason k-NN and Random Forest gives 98%-100% accuracy while PCA, k-Means and Hierarchical Clustering gives much less accurate predictions on this bank loan status dataset.

Nonetheless, Principal Components PC1 and PC2 combined accounts for 71% variability in the data. Hierarchical clustering gives better internal validation and stability measures than k-Means and PAM. Of the Hierarchical clustering, Divisive method gives better result than agglomerrative approach in terms of balance of the two cluster sizes given balanced data were used in the model. Silhouette score and Elbow method both predict 2 clusters, while Gap predicted 1 wrongly.

Hypothesis testing shows that the difference of k-NN performance accuracy of 98% and RF performance accuracy of 100% is by chance and the difference is not statically significant. This is interesting to know.

A weakness of this dataset is that the two clusters do not have good separability (Hopkins Statistics is 0.37). This is probably because the data is noisy or data is not ideal for unsupervised algorithms. However, I cannot change the inherent nature/pattern of the data. So PAC, k-Means and Hierarchical clustering analyses were carried out even though the data is not ideal.

In the future, it would be helpful to find or develop an unsupervised algorithm that can improve the prediction result, because the prediction was excellent when providing with a target variable.